Neuron Recycling

Authors
Affiliations

Jakub Krajewski

University of Warsaw

IDEAS NCBR

Maciej Pióro

University of Warsaw

IDEAS NCBR

Sebastian Jaszczur

University of Warsaw

IDEAS NCBR

Published

July 5, 2023

Sparse neural networks have garnered attention due to their theoretical promise of lowered computational demands and memory savings. However, to this date, the theoretical gains have largely failed to materialize due to the lack of hardware support for this kind of models. In this work, we explore the idea of neuron recycling which is inspired by pruning - a method often employed to induce sparsity in neural networks. We also present lessons we have learned along the way.

Introduction

Pruning is a well-established technique used to sparsify neural networks. It relies on the fact that typically a large part of a trained neural network can be masked without impacting the accuracy of the network, albeit often requiring additional fine-tuning in order to regain some lost performance. Despite multiple works proposing various neuron-selection criteria for pruning, magnitude-based pruning remains a viable option. . The main point of the LTH is that through iterative pruning, performant subnetworks depending on the initialization can be found in neural networks. Those well-initialized network fragments are the namesake of LTH (the “lottery tickets”). By combining the two ideas (pruning and LTH) we arrive at a new potential technique for raising neural network performance. If we are able to remove parts of the network without hurting the performance (pruning) and the fitness of a part of a network is determined at initialization, perhaps we could re-initialize the unnecessary network parts (i. e. draw more “lottery tickets”).

Understanding neuron magnitudes

One of the first natural questions we have asked ourselves, as we already knew that we wanted to take into consideration neuron magnitudes, was how does this metric behave and evolve during the training process. Intuitively, the magnitude of a neuron correlates with its significance. At first, we thought a histogram of neuron magnitudes would exhibit a normal distribution. However, our findings showed something different - the distribution follows a Pareto-like shape, with a majority of almost-zero or small values, as seen in the plot below.


This discovery raised several questions. Are the majority of neurons with very small magnitudes insignificant? Can they be dropped? Or could it be that these “small” neurons are activated rarely, but are crucial for certain tasks? Alternatively, perhaps when combined, they contribute significantly to the overall performance? Although we initially expected a different distribution, we wanted to explore and find the “ideal” distribution of neurons.

As a side note: interestingly, we noticed that this discrepancy does not happen in the last layers of the network. As an example, below you can examine magnitudes for neurons in the 8th FF layer of the network.


In order to find the ideal neuron distribution, we decided to give special attention to the feed-forward (FF) component of the network. We periodically froze all parts of the network except for the FF component and continued to train it for several batches of data. We hypothesized that in this scenario, weaker neurons might “catch up,” resulting in a more even distribution. The results of this experiment can be seen in the following plot.

We have also examined the scenario of retraining only small magnitude neurons, only large magnitude neurons and random subsets. How does it affect the performance? The results are depicted on the following plot.

[plot will be attached here]

Interestingly, retraining only the smallest neurons yields the best results when compared to reinforcing high-magnitude neurons or random subsets. This provided a compelling argument in favor of our technique. However, it is important to note that these experiments were based on pretraining relatively small, BERT-based models. We were curious to see how our observations would translate to well-established, large-scale foundation models like BERT, T5, and GPT-2.

[plot - magnitudes in foundation models]

Upon examining the magnitudes in these foundation models, noticeable differences emerge. The magnitudes in T5 seem similar to those in our smaller models, while BERT and GPT-2 display more favorable distributions. What could account for these variations? We discovered that the use of weight decay plays a significant role. This simple but widely used technique has a considerable impact on the distribution phenomenon we’ve been investigating.

These findings support the idea of exploring neuron recycling more thoroughly and offer a solid foundation for further experiments. In subsequent sections, we will delve into the results of these investigations and share our insights.

Recycling

The central part of our work was a method we called neuron recycling. The whole process boils down to three phases, repeated periodically: training, selection and reinitialization.

  • In the training phase, the model is trained to predict masked tokens (masked language modelling).
  • In the selection phase, the least important neurons are determined, where the baseline criterion is neuron magnitude
  • In the reinitialization phase, new weights are assigned to neurons.

Although this procedure is conceptually simple, it allows for many degrees of freedom. Here are some choices that can be made:

  • The number of training steps before consecutive selection / reinitialization phases
  • The percentage of recycled neurons
  • Selection / reinitialization strategies

After examining the pruning literature [citations], and also based on our own experiments, we have found that the simple magnitude-based approach works best in most cases . It is also easy to implement and computationally efficient.

<wykres: pruning random/małe/duże/brak ff-a od początku>

It has been shown , that the initialization of the weights in neural networks is of utmost importance to the convergence of the network. In all our experiments, we initialized the linear layers using the following distribution: as per .

Baseline recycling

The most straightforward reinitialization scheme is to sample the weights of the reinitialized neurons from the initial distribution. After examining the performance of this solution, we could not see any difference between recycling and vanilla training. As a sanity check, we have examined the histogram presenting the number of times each neuron was recycled, discovering that the same small subset of neurons was being reinitialized over and over during training. As we have seen , on average magnitude of neurons grows throughout the training. Therefore, the newly sampled neurons have, on average, lower magnitudes than non-recycled neurons and are unable to catch up to non-recycled neurons before another selection phase. Thus, the recycled neurons are caught up in a vicious cycle in which they are always recycled before achieving high magnitude.

<spróbować zrobić diagram z vicious cycle>

Immunity

Another thing we tried to make reinitialization work was Immunity. The idea here is to encourage diverse recycling by making each recycled neuron immune to further recycling for some predefined number of steps. We hypothesized that a reinitialized neuron needs some time to grow, but in our initial setting, the neuron is recycled before enough growth happens, keeping the smaller neurons in a vicious cycle of periodically reoccurring reinitialization. Unfortunately, in our experiments, the reinitialized neurons failed to grow even if given immunity. At the same time immunity of the small neurons lead to recycling of the large neurons, hurting the model’s performance.

Modifying reinitialization weights distribution

As we have pointed out before, magnitude and weight distribution drifts away from the initial distribution as the training progresses. However, during our initial attempts, we initialized the weights sampling from the initial distribution. To fix this issue, we decided to try out another weight sampling technique. A feed-forward layer consists of two matrices. Each time a neuron is reinitialized, the corresponding row and column in the matrices are filled with new (reinitialized) weights. The new technique, when replacing the weights, used the normal distribution with mean and standard deviation equal to the mean and standard deviation of all the weights in the respective matrix. This approach, like immunity, eliminated the vicious cycle. However, this process introduced a lot of noise with adverse effect on the model’s loss.

At this stage of our project, we gained interest in the topic of growing neural networks. Intuitively, this problem has a similar part, in which we want to add new neurons or weights. In the case of Large Language Models, the topic is mentioned in [Gopher]. The authors describe that for their experiments, copying existing neurons was the best strategy. We have also tried this strategy, but without positive results.

Starając się wyjaśnić nasze dotychczasowe problemy, zwróciliśmy uwagę na dyskretną naturę naszej techniki, as opposed to the continuous landscape of training/optimizers. To change this, we have modified our strategy to gradually change the value by linear interpolation, according to the rule

<latex ze wzorkiem; może jakiś plot z naszymi górkami>,

where was frozen and trainable. We have noticed that this way we were able to make training loss dynamics smoother, but still without any positive effect.

In general, we couldn’t beat the baseline in terms of performance.

Other ideas we researched

Midpoint Loss

While working towards the main goal of the project, we started investigating the distribution of neuron magnitudes during the training [link do sekcji z blogposta]. We noticed that the magnitude distribution is quite uneven - a large percentage of neurons remains small, and the distribution is right-skewed. Since the goal of our project was to reduce the number of low-quality, i.e., small neurons, we came up with a pretty risky solution: Midpoint Loss. The idea was to introduce an additional loss term that would encourage neuron growth and ``even-ness’’ of the magnitude distribution. The general equation for the midpoint loss is:

\[ Loss = \sum_{l = 1}^{L} \sum_{n = 1}^{dff} dist(M_{l,n}, stop\_grad(midpoint(M_{l,*}))), \] where:

  • \(M_{l,n}\) -magnitude of then-th neuron in the l-th layer \(dist\) - some distance function, typically\(l_1\),\(l_2\) \(midpoint\) -mean or median. Median can be used instead of mean because of itsrobustness with respect

Since this idea is quite similar to weight decay, we decided not to optimize this term with Adam, but to split it from the task loss and optimize it using simple gradient descent - a similar technique is used in AdamW to incorporate weight decay loss term. Since we weren;c

<plot, że iwd zmienia dystrybucję magnitude> <wariacje nt. Plotu ze strzałkami tłumaczące różne opcje w immunity>

Takeaways

Takeaways from the project. Test test2